Complete Analysis of Their Eyes Were Watching God

For my creative project I will write some code that will analyze Their Eyes Were Watching God sentence by sentence. The very first thing I will do is analyze the sentiment of every single sentence in the book. The sentiment of a sentence is a number between $0$ and $1$ which describes whether or not a sentence is positive or negative. A sentiment values of $0,0.5$ and $1$ simply map to sentiments negative,neutral and positive. Using this we can visualize the projection of sentiment throughout the book and map events to these sentiments.

The first thing we do is load up the text file containing the book and split it up by sentences.


In [15]:
import nltk.data
with open('D:\\Temp\\Their-Eyes-Were-Watching-God-rmrju9.txt', 'r') as content_file:
    content = content_file.read()

sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
tokens = sent_detector.tokenize(content)

The next part is a little tricky to explain. Essentially what we do is create a function that given a sentence can give us a sentiment value.


In [32]:
import numpy as np

from scipy.optimize import minimize

import sklearn.linear_model as ln
import nltk.classify.util

from nltk.classify import *
from nltk.corpus import movie_reviews
import nltk.corpus as corpus
import nltk.tokenize as tokenize
import nltk.data as data
from sets import Set

class GeneralTokenizerTools():
    def __init__(self):
        self.stopwords = Set(corpus.stopwords.words('english'))
        self.sent_tokenizer = data.load('tokenizers/punkt/english.pickle')
    def tokenize_remove_stopwords_sentence(self,sentence):
        return [x for x in tokenize.WhitespaceTokenizer().tokenize(sentence) if x not in self.stopwords]
    def tokenize_remove_stopwords_sentences(self,sentences):
        return [self.tokenize_remove_stopwords_sentence(x) for x in sentences]
    def tokenize_by_sentence(self,text):
        return self.tokenize_remove_stopwords_sentences(self.sent_tokenizer.tokenize(text))
    
def word_feats(words):
    return dict([(word, True) for word in words])
 
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
 
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 0) for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 1) for f in posids]

cut = .99
negcutoff = int(len(negfeats) * cut)
poscutoff = int(len(posfeats) * cut)
 
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))
 
classifier = SklearnClassifier(ln.LogisticRegression(),sparse=False).train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)

tool = GeneralTokenizerTools()

def predict(x):
    return classifier.prob_classify(dict([(y,True) for y in tool.tokenize_remove_stopwords_sentence(x)]))


train on 1980 instances, test on 20 instances
accuracy: 0.95

Some more math magic that will make the presentation a bit nicer to look at.


In [76]:
import matplotlib.pylab as plt

%pylab inline
def exponential_smoothing(x):
    def y(alpha, x):
        y = np.empty(len(x), float)
        y[0] = x[0]
        for i in xrange(1, len(x)):
            y[i] = x[i - 1] * alpha + y[i - 1] * (1 - alpha)
        return y

    def mape(alpha, x):
        diff = y(alpha, x) - x
        return np.mean(diff / x) ** 2

    guess = .5
    result = minimize(mape, guess, (x,), bounds=[(0,1)],method='L-BFGS-B')
    print result
    return y(result.x,x)

def moving_average(x,y,step_size=.1,bin_size=1):
    x = x or np.arange(0,len(y),1)
    y = np.array(y)
    bin_centers = np.arange(np.min(x),np.max(x) - 0.5 * step_size,step_size) + 0.5 * step_size
    bin_avg = np.zeros(len(bin_centers))

    for index in range(0,len(bin_centers)):
        bin_center = bin_centers[index]
        items_in_bin = y[(x > (bin_center - bin_size * 0.5)) & (x < (bin_center + bin_size * 0.5))]
        bin_avg[index] = np.average(items_in_bin)

    return bin_centers,bin_avg


Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['plt']
`%matplotlib` prevents importing * from pylab and numpy

We classify the sentiment of each sentence and we smooth everything out. So for example if we have 5 sentiments of sentences near eachother, and 4 of them are positive we will just say that 5th one is positive too (even if it is negative). This allows the presentation to be a bit nicer.


In [67]:
page_count = 218
a_pos,b_pos = moving_average(None,[predict(x).prob(1) for x in tokens],bin_size=100)

The final stage is to visualize it.


In [73]:
plt.title("Sentiment of each Sentence")
plt.xlabel("Page")
plt.ylabel("Probability of Sentence Being Positive")
plt.plot(np.arange(0,page_count,float(page_count)/len(b_pos)), (b_pos - min(b_pos))/(max(b_pos)-min(b_pos)))
plt.show()


Analysis of Sentiment

It is interesting to note that their are about three chunks (not including up to page 30 since that was the foreward) which correlate with the men she has met throughout the book. It seems counterintuitive that the most positive but sporadic chunk of the book occured when she was with logan while the most consistent was with Tea-Cake.

The next thing we will do is follow the trend of independence and love and how this correlates with the theme of a free women throughout the book. To do this we will analyze every individual sentence in the book to see if they contain some key words.


In [165]:
from scipy import stats

words_of_interest = ["independence","independent","free","love","like","happy","woman"]
def exists(x):
    for i in words_of_interest:
        if i in x:
            return 1
    return 0 
a = (np.array([x for x in range(len(tokens)) if exists(tokens[x]) == 1 ]))
kde = stats.gaussian_kde(a)

plt.xlabel("Sentence")
plt.ylabel("Average Occurence")
plt.plot(range(5000),[-np.log(kde(x) * 1000) for x in range(5000)])
plt.show()


Analysis of Trends

In this part of my project I analyze the words that correlate with the major themes of the book. It is interesting to note that the low point of the protaganist is in the middle of the book while she reaches her peak was in the end of the book with tea cake. It is intereting to note that her level of independence and love drop after the death of Tea Cake.


In [ ]: